[Feat] [history server] Add actor task endpoint#4463
[Feat] [history server] Add actor task endpoint#4463JiangJiaWei1103 wants to merge 50 commits intoray-project:masterfrom
Conversation
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
…types Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
|
||
| taskMap := h.ClusterTaskMap.GetOrCreateTaskMap(clusterSessionKey) | ||
| taskMap.CreateOrMergeAttempt(currTask.TaskID, currTask.TaskAttempt, func(task *types.Task) { | ||
| // --- DEDUPLICATION using (State + Timestamp) as unique key --- |
There was a problem hiding this comment.
Deduplication logic can be reused from here after the node endpoint pr is merged.
| // TODO(jwj): Support profiling_data after TASK_PROFILE_EVENT is supported. | ||
| // Ref: https://github.com/ray-project/ray/blob/d0b1d151d8ea964a711e451d0ae736f8bf95b629/python/ray/util/state/common.py#L1616-L1622. | ||
| // "profiling_data": task.ProfilingData, |
There was a problem hiding this comment.
Will support profiling_data after TASK_PROFILE_EVENT is handled.
1. Filter tasks with exclude_driver and filter triple 2. Limit task number by limit 3. Filter fields by detail 4. Don't support timeout since tasks are in mem Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
Live cluster: Dead Cluster: Missing |
| func ParseOptionsFromReq(req *restful.Request) (ListAPIOptions, error) { | ||
| opts := ListAPIOptions{ | ||
| Limit: RayMaxLimitFromDataSource, | ||
| } |
There was a problem hiding this comment.
Should we align the query parameters with Ray’s Dashboard?
There was a problem hiding this comment.
Thanks for the review! The changes in 4a4ab9b aim to align the behavior as closely as possible with Ray’s Dashboard.
I think there are still a few points worth further discussion:
limit: In Ray, users can configure a client-side limit via theRAY_MAX_LIMIT_FROM_API_SERVERenvironment variable. Should we consider supporting a similar mechanism in the history server?timeout: Since the listing methods in the history server rely on in-memory maps, queries should be fast. Would it be reasonable to ignore the timeout setting in this case?detailandexclude_driver: These now align with Ray’s default values.
There was a problem hiding this comment.
limit: In Ray, users can configure a client-side limit via the RAY_MAX_LIMIT_FROM_API_SERVER environment variable. Should we consider supporting a similar mechanism in the history server?
Yes, we can address this together with the timeout in a follow-up. This also observed the same issue in api/v0/logs/file, we can implement this after all endpoints are completed.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
| Uris *RuntimeEnvUris `json:"uris,omitempty"` | ||
| // The serialized runtime env config passed from the user. | ||
| RuntimeEnvConfig RuntimeEnvConfig `json:"runtimeEnvConfig"` | ||
| } |
There was a problem hiding this comment.
Docs at a4604db: Note that these two fields are never populated on the Ray side.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
SummaryThe following demonstrates the API schema of the Live Cluster{
"result": true,
"msg": "",
"data": {
"result": {
"total": 4,
"num_after_truncation": 4,
"num_filtered": 4,
"result": [
{
"attempt_number": 0,
"language": "PYTHON",
"job_id": "02000000",
"events": [
{
"state": "PENDING_ARGS_AVAIL",
"created_ms": 1770211318150
},
{
"state": "PENDING_NODE_ASSIGNMENT",
"created_ms": 1770211318150
},
{
"state": "SUBMITTED_TO_WORKER",
"created_ms": 1770211318150
},
{
"state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY",
"created_ms": 1770211318150
},
{
"state": "RUNNING",
"created_ms": 1770211318150
},
{
"state": "FINISHED",
"created_ms": 1770211318150
}
],
"required_resources": {},
"func_or_class_name": "Counter.get_count",
"task_log_info": null,
"label_selector": {},
"name": "Counter.get_count",
"profiling_data": {
"component_type": "worker",
"component_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
"node_ip_address": "10.244.0.48",
"events": [
{
"start_time": 1770211318150.7988,
"end_time": 1770211318150.8,
"extra_data": {},
"event_name": "task:deserialize_arguments"
},
{
"start_time": 1770211318150.8047,
"end_time": 1770211318150.82,
"extra_data": {},
"event_name": "task:execute"
},
{
"start_time": 1770211318150.8213,
"end_time": 1770211318150.853,
"extra_data": {},
"event_name": "task:store_outputs"
},
{
"start_time": 1770211318150.784,
"end_time": 1770211318150.8577,
"extra_data": {
"name": "get_count",
"task_id": "39088be3736e590a051aa2759ceb4431ad03962e02000000"
},
"event_name": "task::Counter.get_count"
}
]
},
"end_time_ms": 1770211318150,
"state": "FINISHED",
"is_debugger_paused": null,
"call_site": null,
"worker_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
"type": "ACTOR_TASK",
"error_type": null,
"runtime_env_info": {
"serialized_runtime_env": "{}",
"runtime_env_config": {
"setup_timeout_seconds": 600,
"eager_install": true,
"log_files": []
}
},
"creation_time_ms": 1770211318150,
"actor_id": "051aa2759ceb4431ad03962e02000000",
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"worker_pid": 241,
"placement_group_id": null,
"start_time_ms": 1770211318150,
"error_message": null,
"task_id": "39088be3736e590a051aa2759ceb4431ad03962e02000000",
"node_id": "49029ecd45f2139d33149e51b8732b5573bbca0f813e50145134d4db"
},
{
"attempt_number": 0,
"language": "PYTHON",
"job_id": "02000000",
"events": [
{
"state": "PENDING_ARGS_AVAIL",
"created_ms": 1770211317649
},
{
"state": "PENDING_NODE_ASSIGNMENT",
"created_ms": 1770211317649
},
{
"state": "SUBMITTED_TO_WORKER",
"created_ms": 1770211317969
},
{
"state": "RUNNING",
"created_ms": 1770211317970
},
{
"state": "FINISHED",
"created_ms": 1770211318053
}
],
"required_resources": {
"CPU": 0.5
},
"func_or_class_name": "my_task",
"task_log_info": {
"stdout_file": "/tmp/ray/session_2026-02-04_05-21-24_586619_1/logs/worker-2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab-02000000-242.out",
"stderr_file": "/tmp/ray/session_2026-02-04_05-21-24_586619_1/logs/worker-2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab-02000000-242.err",
"stdout_start": 36,
"stdout_end": 36,
"stderr_start": 36,
"stderr_end": 36
},
"label_selector": {},
"name": "my_task",
"profiling_data": {
"component_type": "worker",
"component_id": "2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab",
"node_ip_address": "10.244.0.48",
"events": [
{
"start_time": 1770211317971.4631,
"end_time": 1770211318052.6414,
"extra_data": {},
"event_name": "task:deserialize_arguments"
},
{
"start_time": 1770211318052.6765,
"end_time": 1770211318052.7,
"extra_data": {},
"event_name": "task:execute"
},
{
"start_time": 1770211318052.7014,
"end_time": 1770211318052.7334,
"extra_data": {},
"event_name": "task:store_outputs"
},
{
"start_time": 1770211317971.4526,
"end_time": 1770211318052.741,
"extra_data": {
"name": "__main__.my_task",
"task_id": "67a2e8cfa5a06db3ffffffffffffffffffffffff02000000"
},
"event_name": "task::my_task"
}
]
},
"end_time_ms": 1770211318053,
"state": "FINISHED",
"is_debugger_paused": null,
"call_site": null,
"worker_id": "2e6e9aa30b493469bbd886e115704121935592922b6ec19388373bab",
"type": "NORMAL_TASK",
"error_type": null,
"runtime_env_info": {
"serialized_runtime_env": "{}",
"runtime_env_config": {
"setup_timeout_seconds": 600,
"eager_install": true,
"log_files": []
}
},
"creation_time_ms": 1770211317649,
"actor_id": null,
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"worker_pid": 242,
"placement_group_id": null,
"start_time_ms": 1770211317970,
"error_message": null,
"task_id": "67a2e8cfa5a06db3ffffffffffffffffffffffff02000000",
"node_id": "49029ecd45f2139d33149e51b8732b5573bbca0f813e50145134d4db"
},
{
"attempt_number": 0,
"language": "PYTHON",
"job_id": "02000000",
"events": [
{
"state": "PENDING_ARGS_AVAIL",
"created_ms": 1770211318056
},
{
"state": "PENDING_NODE_ASSIGNMENT",
"created_ms": 1770211318056
},
{
"state": "SUBMITTED_TO_WORKER",
"created_ms": 1770211318063
},
{
"state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY",
"created_ms": 1770211318063
},
{
"state": "RUNNING",
"created_ms": 1770211318063
},
{
"state": "FINISHED",
"created_ms": 1770211318150
}
],
"required_resources": {},
"func_or_class_name": "Counter.increment",
"task_log_info": null,
"label_selector": {},
"name": "Counter.increment",
"profiling_data": {
"component_type": "worker",
"component_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
"node_ip_address": "10.244.0.48",
"events": [
{
"start_time": 1770211318064.0505,
"end_time": 1770211318064.0513,
"extra_data": {},
"event_name": "task:deserialize_arguments"
},
{
"start_time": 1770211318064.055,
"end_time": 1770211318064.0867,
"extra_data": {},
"event_name": "task:execute"
},
{
"start_time": 1770211318064.0889,
"end_time": 1770211318149.785,
"extra_data": {},
"event_name": "task:store_outputs"
},
{
"start_time": 1770211318064.0417,
"end_time": 1770211318149.7983,
"extra_data": {
"name": "increment",
"task_id": "e5cbd90b7f1fb776051aa2759ceb4431ad03962e02000000"
},
"event_name": "task::Counter.increment"
}
]
},
"end_time_ms": 1770211318150,
"state": "FINISHED",
"is_debugger_paused": null,
"call_site": null,
"worker_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
"type": "ACTOR_TASK",
"error_type": null,
"runtime_env_info": {
"serialized_runtime_env": "{}",
"runtime_env_config": {
"setup_timeout_seconds": 600,
"eager_install": true,
"log_files": []
}
},
"creation_time_ms": 1770211318056,
"actor_id": "051aa2759ceb4431ad03962e02000000",
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"worker_pid": 241,
"placement_group_id": null,
"start_time_ms": 1770211318063,
"error_message": null,
"task_id": "e5cbd90b7f1fb776051aa2759ceb4431ad03962e02000000",
"node_id": "49029ecd45f2139d33149e51b8732b5573bbca0f813e50145134d4db"
},
{
"attempt_number": 0,
"language": "PYTHON",
"job_id": "02000000",
"events": [
{
"state": "PENDING_ARGS_AVAIL",
"created_ms": 1770211318056
},
{
"state": "PENDING_NODE_ASSIGNMENT",
"created_ms": 1770211318057
},
{
"state": "RUNNING",
"created_ms": 1770211318060
},
{
"state": "FINISHED",
"created_ms": 1770211318063
}
],
"required_resources": {
"CPU": 0.5
},
"func_or_class_name": "Counter.__init__",
"task_log_info": null,
"label_selector": {},
"name": "Counter.__init__",
"profiling_data": {
"component_type": "worker",
"component_id": "07584bda3bba1ab74b3d9a305a412d2726ce0e91e565c200f2623134",
"node_ip_address": "10.244.0.48",
"events": [
{
"start_time": 1770211318062.4192,
"end_time": 1770211318062.42,
"extra_data": {},
"event_name": "task:deserialize_arguments"
},
{
"start_time": 1770211318062.424,
"end_time": 1770211318062.431,
"extra_data": {},
"event_name": "task:execute"
},
{
"start_time": 1770211318062.4353,
"end_time": 1770211318062.4365,
"extra_data": {},
"event_name": "task:store_outputs"
},
{
"start_time": 1770211318062.4097,
"end_time": 1770211318062.44,
"extra_data": {
"name": "__init__",
"task_id": "ffffffffffffffff051aa2759ceb4431ad03962e02000000"
},
"event_name": "task::Counter.__init__"
}
]
},
"end_time_ms": 1770211318063,
"state": "FINISHED",
"is_debugger_paused": null,
"call_site": null,
"worker_id": null,
"type": "ACTOR_CREATION_TASK",
"error_type": null,
"runtime_env_info": {
"serialized_runtime_env": "{}",
"runtime_env_config": {
"setup_timeout_seconds": 600,
"eager_install": true,
"log_files": []
}
},
"creation_time_ms": 1770211318056,
"actor_id": "051aa2759ceb4431ad03962e02000000",
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"worker_pid": 241,
"placement_group_id": null,
"start_time_ms": 1770211318060,
"error_message": null,
"task_id": "ffffffffffffffff051aa2759ceb4431ad03962e02000000",
"node_id": null
}
],
"partial_failure_warning": "",
"warnings": null
}
}
}Dead Cluster{
"result": true,
"msg": "",
"data": {
"result": {
"total": 6,
"num_after_truncation": 6,
"num_filtered": 4,
"result": [
{
"actor_id": "f6080df37a35848b2468441a02000000",
"attempt_number": 0,
"call_site": null,
"creation_time_ms": 1770211244699,
"end_time_ms": 1770211244700,
"error_message": null,
"error_type": null,
"events": [
{
"created_ms": 1770211244699,
"state": "PENDING_ARGS_AVAIL"
},
{
"created_ms": 1770211244699,
"state": "PENDING_NODE_ASSIGNMENT"
},
{
"created_ms": 1770211244699,
"state": "SUBMITTED_TO_WORKER"
},
{
"created_ms": 1770211244700,
"state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
},
{
"created_ms": 1770211244700,
"state": "RUNNING"
},
{
"created_ms": 1770211244700,
"state": "FINISHED"
}
],
"func_or_class_name": "Counter.get_count",
"is_debugger_paused": null,
"job_id": "02000000",
"label_selector": {},
"language": "PYTHON",
"name": "Counter.get_count",
"node_id": "3c5bf314a66f86ebfeee453d409b8318fa46f636a12a7a0a23feecab",
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"placement_group_id": null,
"required_resources": {},
"runtime_env_info": {
"runtime_env_config": {
"eager_install": true,
"log_files": [],
"setup_timeout_seconds": 600
},
"serialized_runtime_env": "{}"
},
"start_time_ms": 1770211244700,
"state": "FINISHED",
"task_id": "39088be3736e590af6080df37a35848b2468441a02000000",
"task_log_info": null,
"type": "ACTOR_TASK",
"worker_id": "a095c2237c3c8d2bd1bd834117e4d4b89abf7fc904a1bf6cbda4af5d",
"worker_pid": 240
},
{
"actor_id": "",
"attempt_number": 0,
"call_site": null,
"creation_time_ms": 1770211244350,
"end_time_ms": 1770211244615,
"error_message": null,
"error_type": null,
"events": [
{
"created_ms": 1770211244350,
"state": "PENDING_ARGS_AVAIL"
},
{
"created_ms": 1770211244350,
"state": "PENDING_NODE_ASSIGNMENT"
},
{
"created_ms": 1770211244536,
"state": "SUBMITTED_TO_WORKER"
},
{
"created_ms": 1770211244536,
"state": "RUNNING"
},
{
"created_ms": 1770211244615,
"state": "FINISHED"
}
],
"func_or_class_name": "my_task",
"is_debugger_paused": null,
"job_id": "02000000",
"label_selector": {},
"language": "PYTHON",
"name": "my_task",
"node_id": "3c5bf314a66f86ebfeee453d409b8318fa46f636a12a7a0a23feecab",
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"placement_group_id": null,
"required_resources": {
"CPU": 0.5
},
"runtime_env_info": {
"runtime_env_config": {
"eager_install": true,
"log_files": [],
"setup_timeout_seconds": 600
},
"serialized_runtime_env": "{}"
},
"start_time_ms": 1770211244536,
"state": "FINISHED",
"task_id": "67a2e8cfa5a06db3ffffffffffffffffffffffff02000000",
"task_log_info": null,
"type": "NORMAL_TASK",
"worker_id": "2f051136bf328c5e163175e6d00dd74c7f12ddafee86cdf513e1b2bf",
"worker_pid": 241
},
{
"actor_id": "f6080df37a35848b2468441a02000000",
"attempt_number": 0,
"call_site": null,
"creation_time_ms": 1770211244618,
"end_time_ms": 1770211244699,
"error_message": null,
"error_type": null,
"events": [
{
"created_ms": 1770211244618,
"state": "PENDING_ARGS_AVAIL"
},
{
"created_ms": 1770211244618,
"state": "PENDING_NODE_ASSIGNMENT"
},
{
"created_ms": 1770211244623,
"state": "SUBMITTED_TO_WORKER"
},
{
"created_ms": 1770211244624,
"state": "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
},
{
"created_ms": 1770211244624,
"state": "RUNNING"
},
{
"created_ms": 1770211244699,
"state": "FINISHED"
}
],
"func_or_class_name": "Counter.increment",
"is_debugger_paused": null,
"job_id": "02000000",
"label_selector": {},
"language": "PYTHON",
"name": "Counter.increment",
"node_id": "3c5bf314a66f86ebfeee453d409b8318fa46f636a12a7a0a23feecab",
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"placement_group_id": null,
"required_resources": {},
"runtime_env_info": {
"runtime_env_config": {
"eager_install": true,
"log_files": [],
"setup_timeout_seconds": 600
},
"serialized_runtime_env": "{}"
},
"start_time_ms": 1770211244624,
"state": "FINISHED",
"task_id": "e5cbd90b7f1fb776f6080df37a35848b2468441a02000000",
"task_log_info": null,
"type": "ACTOR_TASK",
"worker_id": "a095c2237c3c8d2bd1bd834117e4d4b89abf7fc904a1bf6cbda4af5d",
"worker_pid": 240
},
{
"actor_id": "",
"attempt_number": 0,
"call_site": null,
"creation_time_ms": 1770211244618,
"end_time_ms": 1770211244623,
"error_message": null,
"error_type": null,
"events": [
{
"created_ms": 1770211244618,
"state": "PENDING_ARGS_AVAIL"
},
{
"created_ms": 1770211244619,
"state": "PENDING_NODE_ASSIGNMENT"
},
{
"created_ms": 1770211244621,
"state": "RUNNING"
},
{
"created_ms": 1770211244623,
"state": "FINISHED"
}
],
"func_or_class_name": "Counter.__init__",
"is_debugger_paused": null,
"job_id": "02000000",
"label_selector": {},
"language": "PYTHON",
"name": "Counter.__init__",
"node_id": "",
"parent_task_id": "ffffffffffffffffffffffffffffffffffffffff02000000",
"placement_group_id": null,
"required_resources": {
"CPU": 0.5
},
"runtime_env_info": {
"runtime_env_config": {
"eager_install": true,
"log_files": [],
"setup_timeout_seconds": 600
},
"serialized_runtime_env": "{}"
},
"start_time_ms": 1770211244621,
"state": "FINISHED",
"task_id": "fffffffffffffffff6080df37a35848b2468441a02000000",
"task_log_info": null,
"type": "ACTOR_CREATION_TASK",
"worker_id": "",
"worker_pid": 240
}
],
"partial_failure_warning": "",
"warnings": null
}
}
}Taking Follow-ups1. Historical ReplayCurrently, TODO: Preserve timestamped task states for each task attempt. 2. Filters and Query ParametersDifferent state entities (e.g., nodes, tasks, actors) may have their own sets of filterable fields for GET APIs. Each entity could define its own filterable fields (similar to this example) while reusing the shared filtering helpers. |
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
Once this PR goes through the final pass, I'll revert the first commit used for local dev. Thanks. |
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
win5923
left a comment
There was a problem hiding this comment.
LGTM! Thanks for your effort.
Future-Outlier
left a comment
There was a problem hiding this comment.
chatted with @JiangJiaWei1103 offline, LGTM, tks!
|
cc @rueian to merge, thank you! |


Why are these changes needed?
This PR adds support for the
/api/v0/tasksendpoint to the history server, making data structure definitions and event processing logic compatible with theACTOR_TASK.NOTE: We don't support the historical replay of the task state transitions in the alpha version. So, all lifecycle-related fields are gracefully downgraded to the last snapshot (i.e., overriding lifecycle-related fields in-place).
Change Summary
At a high level, this PR introduces changes across two main layers:
History Server Layer
/api/v0/tasksendpointexclude_driverand filter triple (key, predicate, value)detailEvent Server Layer
TASK_DEFINITION_EVENT,ACTOR_TASK_DEFINITION_EVENTandTASK_LIFECYCLE_EVENTTaskMap, and expose aGetTaskshelper for consumption by the history server layerTest Result
Related issue number
Closes #4388.
Checks